iT邦幫忙

2025 iThome 鐵人賽

DAY 24
0
AI & Data

雲端情人 - AI 愛系列 第 24

把服務撐起來-重構:健康檢查、監控、穩定度三寶(FastAPI × LINE SDK v3)

  • 分享至 

  • xImage
  •  

Day 23|把服務撐起來:健康檢查、監控、穩定度三寶(FastAPI × LINE SDK v3)

目標:上線後不再靠運氣。今天把專案補上
1. 健康檢查(liveness / readiness)
2. 結構化日誌+Request ID 關聯
3. 穩定度三件套:超時、重試、熔斷/節流
4. (可選)Prometheus 指標,一眼看懂流量與錯誤

  1. 健康檢查:/healthz 與 /readyz

Liveness 用來告訴平台「我還活著」,Readiness 表示「我現在可以接單了」。
在 Render/K8s/雲主機都通用。

health.py

from fastapi import APIRouter
from fastapi.responses import PlainTextResponse
import httpx, os

router = APIRouter()

@router.get("/healthz")
async def healthz():
# 輕量檢查:進程活著即可
return PlainTextResponse("ok", status_code=200)

@router.get("/readyz")
async def readyz():
# 重要相依:環境變數、外部服務可用性(快速、設超時)
required_envs = ["CHANNEL_ACCESS_TOKEN", "CHANNEL_SECRET"]
missing = [k for k in required_envs if not os.getenv(k)]
if missing:
return PlainTextResponse(f"env missing: {','.join(missing)}", status_code=503)

try:
    # 以 LINE Profile API 當快速探針(或你選擇的輕探針)
    async with httpx.AsyncClient(timeout=2) as c:
        # 不必真的打受權 API,打公共端點或 DNS 也行
        await c.get("https://api.line.me/", timeout=2)
except Exception as e:
    return PlainTextResponse(f"dep: line_api {e}", status_code=503)

return PlainTextResponse("ready", status_code=200)

在 app_fastapi.py:

from health import router as health_router
app.include_router(health_router)

好處
• 平台可自動重啟「掛住的」實例(healthz 失敗)。
• 部署時等到 readiness=OK 才納入流量,避免冷啟期間丟事件。

  1. 結構化日誌+Request ID

把 log 變成可搜尋的 JSON,並用 Request ID 串起一整條呼叫鏈(Webhook → 你的處理 → 外部 API)。

2.1 Middleware 加 Request ID

logging_setup.py

import logging, json, uuid
from typing import Callable
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response
from contextvars import ContextVar

request_id_ctx: ContextVar[str] = ContextVar("request_id", default="-")

class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {
"level": record.levelname,
"msg": record.getMessage(),
"logger": record.name,
"request_id": request_id_ctx.get("-"),
}
if record.exc_info:
payload["exc_info"] = self.formatException(record.exc_info)
return json.dumps(payload, ensure_ascii=False)

def setup_json_logging():
h = logging.StreamHandler()
h.setFormatter(JsonFormatter())
root = logging.getLogger()
root.handlers = []
root.addHandler(h)
root.setLevel(logging.INFO)

class RequestIdMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next: Callable):
rid = request.headers.get("x-request-id", str(uuid.uuid4()))
token = request_id_ctx.set(rid)
try:
resp: Response = await call_next(request)
resp.headers["x-request-id"] = rid
return resp
finally:
request_id_ctx.reset(token)

啟用:

app_fastapi.py

from logging_setup import setup_json_logging, RequestIdMiddleware
setup_json_logging()
app.add_middleware(RequestIdMiddleware)

使用:

import logging
log = logging.getLogger("uvicorn.error")
log.info("handle text event") # 自動帶出 request_id

查問題時,只要在 Log 面板搜尋某個 request_id,就能看到整段過程。

  1. 穩定度三件套:Timeout、Retry、熔斷/節流

3.1 通用的 httpx 輕量封裝(含超時+重試)

http_helpers.py

import httpx, asyncio, random

async def fetch_json(url: str, headers=None, timeout=4.0, retries=2, backoff=(0.2, 0.8)):
last = None
for i in range(retries + 1):
try:
async with httpx.AsyncClient(timeout=timeout) as c:
r = await c.get(url, headers=headers)
r.raise_for_status()
return r.json()
except Exception as e:
last = e
if i < retries:
await asyncio.sleep(random.uniform(*backoff))
raise last

使用(例如匯率 API):

from http_helpers import fetch_json
async def get_twd_per(target="JPY"):
data = await fetch_json(f"https://open.er-api.com/v6/latest/{target}")
return data["rates"]["TWD"]

3.2 熔斷 & 節流(超簡版)

避免外部服務掛了你還狂打;也避免某使用者刷爆群組。

circuit.py

import time
FAIL_MAX = 5
COOL_DOWN = 30 # 秒
_state = {"fails":0, "until":0}

def can_call() -> bool:
return time.time() >= _state["until"]

def record(success: bool):
if success:
_state["fails"] = 0
_state["until"] = 0
else:
_state["fails"] += 1
if _state["fails"] >= FAIL_MAX:
_state["until"] = time.time() + COOL_DOWN

呼叫時:

from circuit import can_call, record
if not can_call():
return "外部服務繁忙,稍後再試 🙏"
try:
# do external call...
record(True)
except Exception:
record(False)
raise

使用者節流(per-chat 簡易限流):

throttle.py

import time
WINDOW = 5 # 秒
MAX_REQ = 8 # 視需求調整
_buckets = {} # chat_id -> [(ts1), (ts2)...]

def allow(chat_id: str) -> bool:
now = time.time()
q = _buckets.setdefault(chat_id, [])
# 移除過期
while q and now - q[0] > WINDOW:
q.pop(0)
if len(q) >= MAX_REQ:
return False
q.append(now)
return True

在文字事件最前面:

from throttle import allow
if not allow(chat_id):
await reply("稍等一下下~我先喘口氣 😮‍💨")
return

  1. (可選)Prometheus 指標

想要粗粒度監控非常好用:QPS、錯誤率、熱門功能。

metrics.py

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import APIRouter
from fastapi.responses import Response
import time
router = APIRouter()

EVENTS_TOTAL = Counter("events_total", "Total LINE events", ["type"])
ERRORS_TOTAL = Counter("errors_total", "Total errors", ["where"])
LATENCY = Histogram("handler_latency_seconds", "Handler latency", ["name"])

@router.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

用法示例

from contextlib import contextmanager
@contextmanager
def observe(name: str):
start = time.time()
try:
yield
finally:
LATENCY.labels(name=name).observe(time.time() - start)

在事件處理時:

from metrics import router as metrics_router, EVENTS_TOTAL, ERRORS_TOTAL, observe
app.include_router(metrics_router)

EVENTS_TOTAL.labels(type="text").inc()
with observe("text_handler"):
... # 你的處理

  1. 例外處理:回應統一、Log 更清楚

from fastapi import Request
from fastapi.responses import JSONResponse
import logging, traceback
log = logging.getLogger("uvicorn.error")

@app.exception_handler(Exception)
async def global_exc(request: Request, exc: Exception):
log.error("unhandled", exc_info=exc)
return JSONResponse({"error": "internal error"}, status_code=500)

  1. 部署前檢查清單
    • /healthz 200、/readyz 200(冷啟 內可容忍 1–2 秒)
    • /metrics 可抓到 events_total、errors_total
    • Log 為 JSON,且含 x-request-id
    • 外部呼叫皆有 timeout(≤ 5s)與 retries(≤ 2 次)
    • 熔斷/節流機制不會讓服務在外部掛掉時「自殺式猛撞」
    • 在群組中做壓力測試(10 秒內 50 則文字)仍能正常回應或優雅退讓

  1. Debug 心法(實戰)
    • 找不到某一次回應?
    先用 webhook log 的 request_id 去查應用 log;看是否卡在外部呼叫(latency 直方圖也能印證)。
    • 翻譯模式忽然失效?
    搜 translation_states 的狀態變更 log;多半是 chat_id 變動(換群、換房)或錯誤被全域捕捉吃掉。
    • 偶發 503?
    檢查 /readyz 探針是否過於嚴格(例如每次都去打慢 API),把 readiness 探針改成輕量檢查即可。

小結

今天把觀測與韌性一次補齊:
• /healthz / /readyz 讓部署與自動修復更可靠
• JSON 日誌+Request ID 讓排錯有效率
• 超時+重試+熔斷/節流 撐住外部不穩的情況
• Prometheus 指標 快速看懂尖峰、錯誤與熱門功能

已經具備「生產級」基本功。


上一篇
教育her-重構:用 WebhookParser 取代(不存在的)AsyncWebhookHandler
下一篇
把 叫HER「查匯優先」與「全回覆語音」做對做滿:規格、實作、除錯到驗收
系列文
雲端情人 - AI 愛25
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言